Note: Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class, including choosing appropriate parameters for all graphs. Do not expect the assignment questions to spell out precisely how the graphs should be drawn. Sometimes guidance will be provided, but the absense of guidance does not mean that all choices are ok.

Read Graphical Data Analysis with R, Chap. 6, 7

1. Crime

Data source: https://data.ny.gov/Public-Safety/Index-Crimes-by-County-and-Agency-Beginning-1990/ca8h-8gjq

You do not need to submit the data with your assignment. You may either download and read from your local copy or read directly from the web site with df <- read_csv("https://data.ny.gov/api/views/ca8h-8gjq/rows.csv").

  1. Create a parallel coordinates plot showing the number of crimes in each of the categories listed for 2020. Show actual counts; do not rescale. Your graph should have one line for each county in New York State. (Use GGally::ggparcoord())
library(tidyverse)
library(GGally)
df <- read_csv("https://data.ny.gov/api/views/ca8h-8gjq/rows.csv")

df2 <- df %>%
  group_by(County, Year) %>%
  summarize(Murder = sum(Murder), Rape = sum(Rape), Robbery = sum(Robbery),
            Assault = sum(`Aggravated Assault`), Burglary = sum(Burglary),
            Larceny = sum(Larceny), `Motor Vehicle Theft` = sum(`Motor Vehicle Theft`),
            Region = max(Region)) %>%
  ungroup() %>%
  filter(County != "Region Total") %>%
  filter(Year == 2020) %>%
  arrange(desc(Region))

ggparcoord(df2, columns = 3:9, scale = "globalminmax")

  1. Now experiment with alpha blending, splines, and rescaling to create the clearest version of the plot that you can. What patterns do you observe? (Use GGally::ggparcoord())
ggparcoord(df2, columns = 3:9, alphaLines = .1, splineFactor = 7)

We can see that there are two groups of counties - the majority of New York state counties have a similar, lower number of robberies and assaults, and the few outliers with very high values of every number of crime.

  1. Create an interactive parallel coordinates plot of the same data, coloring by Region. Discuss outliers, clusters, and correlations in detail.
# Unfortunately there is no way to reorder the counties
# see: https://github.com/timelyportfolio/parcoords/issues/16
library(parcoords)
  df2 %>% select(-Year) %>% 
    parcoords(rownames = F 
    , brushMode = "1D-axes"
    , reorderable = T
    , queue = T
    , color = list(
      colorBy = "Region"
      ,colorScale = "scaleOrdinal"
      ,colorScheme = "schemeCategory10"
      )
    , withD3 = TRUE
    )   

For the most part, the non-New York City counties have a lower number of crimes, whereas all but one New York city counties have much higher crime. There is one main outlier in each region - Erie (non-New York City) has overall high crime, and Richmond (New York City) has extremely low crime. Monroe and Suffolk (non-NYC) also have a fairly high crime rate overall. Overall, there seems to be a positive correlation between all the crimes (e.g. a county with a higher number of robberies also has a higher number of rapes).

2. Sleep

Data: SleepStudy from Lock5withR package

Draw the following graphs and answer the questions.

  1. Is there an association between ClassYear and AnxietyStatus? Between ClassYear and NumEarlyClass? Justify your answers with mosaic plots.
library(vcd)
library(Lock5withR)
SleepStudy$AnxietyStatus <- factor(SleepStudy$AnxietyStatus, 
                                   levels = c("normal", "moderate", "severe"))
mosaic(AnxietyStatus ~ ClassYear,
       data = SleepStudy,
       direction = c("v", "h"),
       highlighting_fill = RColorBrewer::brewer.pal(3, "Blues"),
       spacing = spacing_equal(sp = unit(0, "lines")),
       labeling = labeling_border(rot_labels = c(0, 0, 0, 0),
                                  set_varnames = c(ClassYear = "Class Year", AnxietyStatus = "Anxiety\nStatus")))

mosaic(NumEarlyClass ~ ClassYear,
       data = SleepStudy,
       direction = c("v", "h"),
       highlighting_fill = RColorBrewer::brewer.pal(6, "Greens"),
       spacing = spacing_equal(sp = unit(0, "lines")),
       labeling = labeling_border(rot_labels = c(0, 0, 0, 0),
                                  set_varnames = c(ClassYear = "Class Year", NumEarlyClass = "# of Early Classes")))

For Class years 1 and 2, there is no association between class year and anxiety status. There is a slight assocation for class years 3 and 4, where class year 4 has more moderately anxious people, but overall there is not a significant assocation.

There is a clear assocation between class year and number of early classes. As class year increases, the number of students who take early classes decreases.

  1. Perform chi square tests to test for associations between the sets of variables graphed in part a). What are the results? Discuss in relation to the mosaic plots.
chisq.test(SleepStudy$ClassYear, SleepStudy$AnxietyStatus)
## 
##  Pearson's Chi-squared test
## 
## data:  SleepStudy$ClassYear and SleepStudy$AnxietyStatus
## X-squared = 4.6165, df = 6, p-value = 0.5938
chisq.test(SleepStudy$ClassYear, SleepStudy$NumEarlyClass)
## 
##  Pearson's Chi-squared test
## 
## data:  SleepStudy$ClassYear and SleepStudy$NumEarlyClass
## X-squared = 54.952, df = 15, p-value = 1.819e-06

The p-value for the assocation between class year and anxiety status is 0.59, which means we fail to reject the null hypothesis that there is not an assocation between the two variables. This agrees with our analysis in part a, where the mosaic plots visually showed no assocation. The p-value for the assocation between class year and number of early classes is essentially 0, therefore we reject the null hypothesis that there is no assocation. This also agrees with our visual analysis with the mosaic plots, where we saw a negative correlation between number of early classes and class year.

  1. How is the relationship between anxiety status and number of early classes affected by class year? Create a mosaic plot showing all three variables, treating anxiety status as the dependent variable. Discuss the results.
mosaic(AnxietyStatus ~ NumEarlyClass+ClassYear,
       data = SleepStudy,
       direction = c("v", "v", "h"),
       highlighting_fill = RColorBrewer::brewer.pal(3, "Blues"),
       spacing = vcd::spacing_dimequal(c(.3, 0, 0)),
       labeling = labeling_border(rot_labels = c(0, 0, 0, 0),
        set_varnames = c(ClassYear = "Class Year",
                         AnxietyStatus = "Anxiety\nStatus",
                         NumEarlyClass = "# of Early Classes")))

Since some of the spines are very narrow, it is helpful to consider a same bin width mosaic plot, which is essentionally a faceted stacked bar chart based on proportions:

sleep <- SleepStudy %>%
  group_by(ClassYear, NumEarlyClass, AnxietyStatus) %>%
  summarize(n = n()) %>%
  mutate(prop = n/sum(n))

sleep %>%
  ggplot(aes(x = ClassYear, y = prop, fill = AnxietyStatus)) +
  geom_col() +
  facet_wrap(~NumEarlyClass) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(3, "Blues")) +
  ggtitle("Anxiety Status by Class Year", sub = "faceted on number of early classes") +
  theme_classic()

The results are perhaps counterintuitive: there is a clear pattern of severe anxiety in students who take 0 or 2 early classes. (One might have expected more anxiety with a higher number of early classes.) We need to be cautious about this conclusion since the sample sizes for 1, 3, 4, and 5 early classes are so small:

SleepStudy %>% group_by(NumEarlyClass) %>% count
## # A tibble: 6 × 2
## # Groups:   NumEarlyClass [6]
##   NumEarlyClass     n
##           <int> <int>
## 1             0    85
## 2             1    14
## 3             2    88
## 4             3    35
## 5             4    11
## 6             5    20
  1. Use the base pairs() function to draw a mosaic pairs plot of all categorical (factor) variables in SleepStudy. (Note: The vcd package must be loaded for pairs() to find the correct method.) Name a pair of variables which appear to have a very strong association. Name a pair of variables which appear not to be associated.
# fig.width=8 and fig.height=8 are included in chunk options

Lock5withR::SleepStudy %>%
  dplyr::select(c(where(is.factor), allNighter, NumEarlyClass, ClassYear)) %>%
  table() %>%
  pairs(space = .2,
        lower_panel = pairs_mosaic(highlighting = 2, spacing = spacing_equal(0)),
        upper_panel = pairs_mosaic(spacing = spacing_equal(0)),
        diag_panel = pairs_barplot(
          gp_vartext = gpar(fontsize = 12),
          gp_leveltext = gpar(fontsize = 8),
          abbreviate = 2))

some examples of strong assocation variables: (depression status, anxiety status), (sex, class year), (depression status, sex) some examples of weak/no assocition variables: (sex, early class), (lark owl, depression status), (alcohol use, early class)

3. Wait List

The file stats_wl.csv contains information about waitlist movement for a Fall 2021 Columbia U undergraduate statistics class.

There are 640 rows and 4 variables:

Name name of student (actual names were replaced with names generated from the randomNames package)

Date since SSOL updates overnight, waitlist positions were collected each morning during the change of program period

Priority position in waitlist, for example 1 = top position on list

Status final outcome, Registered = received a place in class and remained; Dropped Class = received a place in class and left; Left List = left waiting list; Joined = remained on waiting list at the end of the change of program period. (Note that the status reflects what ultimately happened, not what the status was on a particular date.)

Create an alluvial diagram that shows waitlist movement during the change of program period. It is not necessary to include the Name column in the diagram, but it should be possible to observe movement of individual students: for example, that the student who was 22nd in the waitlist on Sept 9th moved up to 15th place on Sept 16th and then left the list.

library(tidyverse)
library(ggalluvial)
df <- read_csv("stats_wl.csv") %>% 
  mutate(Status = fct_relevel(Status, "Registered", "Dropped Class", "Left List", "Joined"))


ggplot(df, aes(x = Date, y = 1, stratum = fct_rev(factor(Priority)), alluvium = Name)) +
  geom_alluvium(aes(fill = Status), color = "grey50") +
  geom_stratum(aes(fill = Status), color = "grey50") +
  geom_text(stat = "stratum", aes(label = after_stat(stratum)), size = 2, color = "grey40") +
  scale_y_reverse(breaks = NULL) +
  xlab("") +
  ylab("Position in waitlist") +
  theme_bw() +
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())

# Alternative solution
g <- list()
for(i in 1:4) {
 g[[i]] <- ggplot(df, aes(x = Date, y = 1, stratum = fct_rev(factor(Priority)), alluvium = Name)) +
  geom_alluvium(aes(fill = Status, alpha = Status), color = "grey50") +
  scale_alpha_manual(values = c(rep(.2, i-1), 1, rep(.2, 4-i))) +
  scale_y_reverse(breaks = seq(1, 56, 5) -.5, labels = seq(1, 56, 5)) +
  scale_x_date(date_breaks = "1 day", date_labels = "%d") +
  ggtitle(levels(df$Status)[i]) +
  xlab("Day (September 2021)") +
  ylab("Position in waitlist") +
  guides(fill = "none", alpha = "none") +
  theme_bw() +
  theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())
}

gridExtra::grid.arrange(g[[1]], g[[2]], g[[3]], g[[4]], ncol = 2)